{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "| [05_spider/01_爬虫介绍.ipynb](https://github.com/shibing624/python-tutorial/blob/master/05_spider/01_爬虫介绍.ipynb)  | Python网络爬虫介绍  |[Open In Colab](https://colab.research.google.com/github/shibing624/python-tutorial/blob/master/05_spider/01_爬虫介绍.ipynb) |\n",
    "\n",
    "# 爬虫介绍\n",
    "\n",
    "## 网络爬虫的概念\n",
    "\n",
    "网络爬虫（web crawler），以前经常称之为网络蜘蛛（spider），是按照一定的规则自动浏览万维网并获取信息的机器人程序（或脚本），曾经被广泛的应用于互联网搜索引擎。使用过互联网和浏览器的人都知道，网页中除了供用户阅读的文字信息之外，还包含一些超链接。网络爬虫系统正是通过网页中的超链接信息不断获得网络上的其它页面。正因如此，网络数据采集的过程就像一个爬虫或者蜘蛛在网络上漫游，所以才被形象的称为网络爬虫或者网络蜘蛛。\n",
    "\n",
    "## 爬虫的应用领域\n",
    "\n",
    "对于大多数的公司而言，及时的获取行业相关数据是企业生存的重要环节之一，然而大部分企业在行业数据方面的匮乏是其与生俱来的短板，合理的利用爬虫来获取数据并从中提取出有商业价值的信息是至关重要的。\n",
    "\n",
    "**网络数据采集是Python最擅长的领域之一。**\n",
    "\n",
    "当我们在浏览器中输入一个url后回车，后台会发生什么？比如说你输入[https://www.baidu.com/](https://www.baidu.com/)，你就会看到百度首页。\n",
    "\n",
    "简单来说这段过程发生了以下四个步骤：\n",
    "\n",
    "1. 查找域名对应的IP地址\n",
    "2. 向IP对应的服务器发送请求\n",
    "3. 服务器响应请求，发回网页内容\n",
    "4. 浏览器解析网页内容\n",
    "\n",
    "网络爬虫要做的，简单来说，就是实现浏览器的功能。通过指定url，直接返回给用户所需要的数据，而不需要一步步人工去操纵浏览器获取。\n",
    "\n",
    "\n",
    "网络爬虫主要分3个大的版块：**抓取**，**解析**，**存储** \n",
    "\n",
    "# 抓取"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 豆瓣爬虫\n",
    "从豆瓣上爬取Top250电影名称："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "肖申克的救赎\n",
      "霸王别姬\n",
      "阿甘正传\n",
      "这个杀手不太冷\n",
      "泰坦尼克号\n",
      "美丽人生\n",
      "千与千寻\n",
      "辛德勒的名单\n",
      "盗梦空间\n",
      "忠犬八公的故事\n",
      "星际穿越\n",
      "楚门的世界\n",
      "海上钢琴师\n",
      "三傻大闹宝莱坞\n",
      "机器人总动员\n",
      "放牛班的春天\n",
      "无间道\n",
      "疯狂动物城\n",
      "大话西游之大圣娶亲\n",
      "熔炉\n",
      "教父\n",
      "当幸福来敲门\n",
      "龙猫\n",
      "怦然心动\n",
      "控方证人\n",
      "触不可及\n",
      "末代皇帝\n",
      "蝙蝠侠：黑暗骑士\n",
      "寻梦环游记\n",
      "活着\n",
      "指环王3：王者无敌\n",
      "哈利·波特与魔法石\n",
      "乱世佳人\n",
      "何以为家\n",
      "素媛\n",
      "飞屋环游记\n",
      "摔跤吧！爸爸\n",
      "十二怒汉\n",
      "哈尔的移动城堡\n",
      "少年派的奇幻漂流\n",
      "我不是药神\n",
      "鬼子来了\n",
      "大话西游之月光宝盒\n",
      "天空之城\n",
      "天堂电影院\n",
      "闻香识女人\n",
      "指环王2：双塔奇兵\n",
      "罗马假日\n",
      "猫鼠游戏\n",
      "辩护人\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import time\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "for page in range(2):\n",
    "    resp = requests.get(\n",
    "        url=f'https://movie.douban.com/top250?start={25 * page}',\n",
    "        headers={'User-Agent': 'BaiduSpider'}\n",
    "    )\n",
    "    soup = BeautifulSoup(resp.text, \"lxml\")\n",
    "    for elem in soup.select('a > span.title:nth-child(1)'):\n",
    "        print(elem.text)\n",
    "    time.sleep(random.random() * 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 搜狐体育爬虫\n",
    "\n",
    "从“搜狐体育”上获取NBA新闻标题和链接的爬虫："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "直播 http://data.sports.sohu.com/nba/nba_schedule_by_day.html\n",
      "赛程 http://data.sports.sohu.com/nba/nba_schedule_by_month.php?\n",
      "季后赛 http://sports.sohu.com/s/nba/playoffs\n",
      "排名 http://data.sports.sohu.com/nba/nba_teams_rank.html\n",
      "球队 http://data.sports.sohu.com/nba/nba_teams.html\n",
      "球员 http://data.sports.sohu.com/nba/nba_players.html\n",
      "数据 http://data.sports.sohu.com/nba/index.html\n",
      "深度 http://www.sohu.com/subject/315837\n",
      "CBA http://sports.sohu.com/s/cba\n",
      "男篮 http://sports.sohu.com/s/tcb\n",
      "女篮 http://cbachina.sports.sohu.com/s2011/tcbw/\n",
      "NBL http://sports.sohu.com/nbl/\n",
      "保罗：希望和哈登沟通过各自角色 休城时光有些灰暗 https://www.sohu.com/a/487971684_458722?scm=1004.728313668755980288.0.0.0\n",
      "深度：三巨头如虎添翼 悍将加盟成篮网新季强心针 https://www.sohu.com/a/487845721_458722?scm=1004.728313668755980288.0.0.0\n",
      "全明星阵容最强球队：新赛季湖人领衔 11年绿军第2 https://www.sohu.com/a/488008799_458722?scm=1004.728313668755980288.0.0.0\n",
      "东契奇：基德是位优秀教练 新赛季目标是竞争总冠军 https://www.sohu.com/a/488005318_458722?scm=1004.728313668755980288.0.0.0\n",
      "豪斯主动示好火箭盼长留 因号码危机卷入交易传闻 https://www.sohu.com/a/488001921_458722?scm=1004.728313668755980288.0.0.0\n",
      "詹皇声援德罗赞球衣退役 猛龙对洛瑞许诺也该给他？ https://www.sohu.com/a/487991983_458722?scm=1004.728313668755980288.0.0.0\n",
      "美媒对湖人三大预测：打不进西部前三 霍顿-塔克拿MIP https://www.sohu.com/a/487660780_458722?scm=1004.728313896368275456.0.0.0\n",
      "布克带詹娜回家见祖母 半路上目睹一场死亡车祸 https://www.sohu.com/a/487988328_458722?scm=1004.728313896368275456.0.0.0\n",
      "名嘴：掘金成湖人西部最大威胁 穆雷是最关键因素 https://www.sohu.com/a/487981122_458722?scm=1004.728313896368275456.0.0.0\n",
      "下赛季圆梦全明星8将：三球领衔 火箭20+10神塔上榜 https://www.sohu.com/a/487637485_458722?scm=1004.728313896368275456.0.0.0\n",
      "30天30队之森林狼：20状元成长喜人 求购本西组BIG4？ https://www.sohu.com/a/487978168_458722?scm=1004.728313896368275456.0.0.0\n",
      "小波特：每天都处在磨砺中 和球队续约谈判进展很好 https://www.sohu.com/a/487973711_458722?scm=1004.728313896368275456.0.0.0\n",
      "勇士三喜临门！汤神透露新造型，美媒建议追小加，官推晒格林数据 http://www.sohu.com/a/488010032_350497?scm=1019.s000a.v1.0\n",
      "勇士官推吹：锋线/中锋赛季场均8.9助就三次包括上赛季追梦 http://www.sohu.com/a/487987793_362070?scm=1019.s000a.v1.0\n",
      "NBA球员胜率排名！詹姆斯排第5，勇士3人上榜，第一很让人意外 http://www.sohu.com/a/487901097_647205?scm=1019.s000a.v1.0\n",
      "库里被打劫时代早已结束连续5年霸榜NBA最高年薪 http://www.sohu.com/a/487842867_458722?scm=1019.s000a.v1.0\n",
      "筹码4换1！乔治联手库里？勇士赶紧换啊！四巨头能炸翻NBA http://www.sohu.com/a/487706677_100005172?scm=1019.s000a.v1.0\n",
      "西蒙斯苦练三分也无用！曝交易谈判全搁浅，勇士只愿出嘴哥+选秀权 http://www.sohu.com/a/487647787_343291?scm=1019.s000a.v1.0\n",
      "西蒙斯下家赔率出炉：勇士第四，开拓者第一，上海大鲨鱼也来插足 http://www.sohu.com/a/487540161_387335?scm=1019.s000a.v1.0\n",
      "砸手里了！6换1，费城漫天要价！勇士放弃交易西蒙斯 http://www.sohu.com/a/487462432_522256?scm=1019.s000a.v1.0\n",
      "勇士记者：达伦-科里森已经去过勇士球馆和库里打过球 http://www.sohu.com/a/487391967_362070?scm=1019.s000a.v1.0\n",
      "又要大结局？四届全明星内线同意加盟篮网，湖人勇士痛失补强良机 http://www.sohu.com/a/487386956_100261089?scm=1019.s000a.v1.0\n",
      "全明星阵容最强球队：新赛季湖人领衔11年绿军第2 http://www.sohu.com/a/488008799_458722?scm=1019.s000a.v1.0\n",
      "湖人消息更新：掘金成最大威胁？主场门票发售，老詹声援德罗赞 http://www.sohu.com/a/488007456_350497?scm=1019.s000a.v1.0\n",
      "NBA-10人离开，1人退役，2人无球可打！湖人20年夺冠阵容大变脸 http://www.sohu.com/a/487994460_387335?scm=1019.s000a.v1.0\n",
      "名嘴：掘金成湖人西部最大威胁穆雷是最关键因素 http://www.sohu.com/a/487981122_458722?scm=1019.s000a.v1.0\n",
      "湖人不只汇集扣将，还有现役生涯常规赛总篮板前十中的五人 http://www.sohu.com/a/487968766_405481?scm=1019.s000a.v1.0\n",
      "慈世平一句话评价湖人！富尼耶做重要决定，老鹰旧将蝉联MVP http://www.sohu.com/a/487902670_647205?scm=1019.s000a.v1.0\n",
      "坚决不留！湖人夺冠赛季邀请他助阵遭拒绝，如今却厚着脸皮来试训 http://www.sohu.com/a/487891655_422959?scm=1019.s000a.v1.0\n",
      "NBA球队平均年龄：湖人31.5岁联盟最老篮网第三老雷霆最年轻 http://www.sohu.com/a/487895489_362070?scm=1019.s000a.v1.0\n",
      "维特斯晒瘦身成功照！发表感悟想从头再来，湖人最后一席有戏吗？ http://www.sohu.com/a/487863329_343291?scm=1019.s000a.v1.0\n",
      "一夜3消息！小前锋排名詹姆斯第3，MVP赔率第7，篮网官宣中锋离队 http://www.sohu.com/a/487806254_641575?scm=1019.s000a.v1.0\n",
      "论情商，KD和欧文堪称NBA版的卧龙凤雏，让人捉摸不透 http://www.sohu.com/a/486943580_425308?scm=1019.s000a.v1.0\n",
      "助湖人拿到第17冠，追平了老东家，隆多还要助湖人再拿第18冠？ http://www.sohu.com/a/486774222_445916?scm=1019.s000a.v1.0\n",
      "继续冲击总冠军，篮网试图续约三巨头，欧文发布有趣消息 http://www.sohu.com/a/485077473_551901?scm=1019.s000a.v1.0\n",
      "斯玛特正式与绿军续约：我们还有更多事情要完成 http://www.sohu.com/a/484908667_458722?scm=1019.s000a.v1.0\n",
      "只打六年也能退役球衣？加内特获至高荣耀，波士顿仁至义尽 http://www.sohu.com/a/484792074_528052?scm=1019.s000a.v1.0\n",
      "盘点休赛期令人意外操作：施罗德去凯尔特人第一，威少、保罗上榜 http://www.sohu.com/a/484574723_428396?scm=1019.s000a.v1.0\n",
      "凯尔特人提前续约斯玛特，对新来的施罗德意味着什么？ http://www.sohu.com/a/484233362_405481?scm=1019.s000a.v1.0\n",
      "刚刚传来2条坏消息！威少拒绝长留湖人，凯尔特人后悔留下斯玛特 http://www.sohu.com/a/484021504_100191063?scm=1019.s000a.v1.0\n",
      "与凯尔特人续约很睿智！塔图姆亲自庆祝兄弟留队，曾出身贫寒 http://www.sohu.com/a/484008710_100202021?scm=1019.s000a.v1.0\n",
      "效力球队7年，防守悍将获绿军大合同认可，施罗德想逆袭难了！ http://www.sohu.com/a/483923208_549701?scm=1019.s000a.v1.0\n",
      "快船被针对？新赛季5次7天5赛，湖人仅1次，NBA官方解释自相矛盾 http://www.sohu.com/a/487930776_561815?scm=1019.s000a.v1.0\n",
      "官宣！热火快船达成交易，克莱做出重大决定，戈贝尔兑现承诺 http://www.sohu.com/a/487567205_691253?scm=1019.s000a.v1.0\n",
      "重磅！快船终于签到人！三年只打1场NBA比赛！他也叫乔治啊！ http://www.sohu.com/a/487546167_100005172?scm=1019.s000a.v1.0\n",
      "回湖人就喷快船！第7位全明星真炸！颤抖吧NBA！有他直奔总冠军 http://www.sohu.com/a/487302703_100005172?scm=1019.s000a.v1.0\n",
      "西部休赛期运作评级：快船勇士获最高A-湖人仅为C http://www.sohu.com/a/487294696_458722?scm=1019.s000a.v1.0\n",
      "NBA公布西部球队实力排行榜！湖人第二，快船第五，第一让人意外 http://www.sohu.com/a/487049204_647205?scm=1019.s000a.v1.0\n",
      "拒绝上场！确定离队！1.7亿先生想去湖人、快船和勇士 http://www.sohu.com/a/487048651_522256?scm=1019.s000a.v1.0\n",
      "快船功勋隆多，为何选择重返湖人？球迷给出答案，且大局已定 http://www.sohu.com/a/487017681_100135484?scm=1019.s000a.v1.0\n",
      "NBA休赛期西部战力排行榜：湖人第二，快船第五，那榜首是哪支球队? http://www.sohu.com/a/486917284_387335?scm=1019.s000a.v1.0\n",
      "NBA版余则成！隆多一年赚1500万+把快船搞砸，底薪回到湖人 http://www.sohu.com/a/486813157_511145?scm=1019.s000a.v1.0\n",
      "NBA联盟各支球队平均年龄：湖人最老，雷霆最年轻，篮网排在第三 http://www.sohu.com/a/487924973_387335?scm=1019.s000a.v1.0\n",
      "官宣！篮网独行侠达成交易，火箭双喜临门，拉文做出重大决定 http://www.sohu.com/a/487960506_691253?scm=1019.s000a.v1.0\n",
      "NBA球队平均年龄：湖人31.5岁联盟最老篮网第三老雷霆最年轻 http://www.sohu.com/a/487895489_362070?scm=1019.s000a.v1.0\n",
      "NBA-官宣交易达成！篮网送别小乔丹，17人大名单更新，又变强了 http://www.sohu.com/a/487863169_387335?scm=1019.s000a.v1.0\n",
      "一夜3消息！小前锋排名詹姆斯第3，MVP赔率第7，篮网官宣中锋离队 http://www.sohu.com/a/487806254_641575?scm=1019.s000a.v1.0\n",
      "湖人为什么抢下被篮网抛弃的小乔丹，到底图什么？ http://www.sohu.com/a/487803680_405481?scm=1019.s000a.v1.0\n",
      "NBA新赛季夺冠赔率：篮网力压湖人高居榜首，勇士第四，雄鹿第三 http://www.sohu.com/a/487722930_387335?scm=1019.s000a.v1.0\n",
      "价值5330万？篮网摆脱小乔丹，小加替身到位，湖人三中锋超100岁 http://www.sohu.com/a/487768319_428396?scm=1019.s000a.v1.0\n",
      "坏消息，比尔前队友被开除！篮网前球员升职，科比女儿畅谈梦想 http://www.sohu.com/a/487765994_647205?scm=1019.s000a.v1.0\n",
      "阿德回归篮网还需裁两人！感谢活塞和马刺，夺冠概率升至第一 http://www.sohu.com/a/487705064_387335?scm=1019.s000a.v1.0\n",
      "NBA新赛季夺冠赔率：篮网力压湖人高居榜首，勇士第四，雄鹿第三 http://www.sohu.com/a/487722930_387335?scm=1019.s000a.v1.0\n",
      "美媒列各队实力档次：雄鹿+湖网日稳坐塔尖，快船无西决实力？ http://www.sohu.com/a/487074768_350497?scm=1019.s000a.v1.0\n",
      "NBA-新赛季三巨头分档：篮网雄鹿第一档，湖人第二档，勇士第三档 http://www.sohu.com/a/486372884_387335?scm=1019.s000a.v1.0\n",
      "东部休赛期实力榜：篮网居首雄鹿排名第二压老鹰 http://www.sohu.com/a/485962572_458722?scm=1019.s000a.v1.0\n",
      "恭喜勇士！恭喜76人！雄鹿续约冠军教头，利拉德做出表态 http://www.sohu.com/a/485936905_691253?scm=1019.s000a.v1.0\n",
      "官网休赛期东部战力榜：篮网第一雄鹿和老鹰分列二三位 http://www.sohu.com/a/485936896_362070?scm=1019.s000a.v1.0\n",
      "罗斯吐槽实力榜，湖人高了，雄鹿才是第二，字母哥是联盟新门面 http://www.sohu.com/a/485817766_561815?scm=1019.s000a.v1.0\n",
      "大O：霍勒迪改变了雄鹿气质是夺冠的关键人物 http://www.sohu.com/a/485756052_458722?scm=1019.s000a.v1.0\n",
      "留住基石！雄鹿与布登霍尔泽完成三年续约合同！目标下赛季能卫冕 http://www.sohu.com/a/485750348_543786?scm=1019.s000a.v1.0\n",
      "雄鹿主帅签下新合同！NBA最佳经纪人失误了，悍将把他告上法庭 http://www.sohu.com/a/485700953_647205?scm=1019.s000a.v1.0\n",
      "直播 http://data.sports.sohu.com/nba/nba_schedule_by_day.html\n",
      "赛程 http://data.sports.sohu.com/nba/nba_schedule_by_month.php?\n",
      "季后赛 http://sports.sohu.com/s/nba/playoffs\n",
      "排名 http://data.sports.sohu.com/nba/nba_teams_rank.html\n",
      "球队 http://data.sports.sohu.com/nba/nba_teams.html\n",
      "球员 http://data.sports.sohu.com/nba/nba_players.html\n",
      "数据 http://data.sports.sohu.com/nba/index.html\n",
      "深度 http://www.sohu.com/subject/315837\n",
      "CBA http://sports.sohu.com/s/cba\n",
      "男篮 http://sports.sohu.com/s/tcb\n",
      "女篮 http://cbachina.sports.sohu.com/s2011/tcbw/\n",
      "NBL http://sports.sohu.com/nbl/\n",
      "赛程 http://cbadata.sports.sohu.com/sch/\n",
      "视频 http://so.tv.sohu.com/list_p1165_p2165100_p3_p4_p5_p6_p7_p8_p9_p10_p11.html\n",
      "数据库 http://cbadata.sports.sohu.com/\n",
      "数据榜 http://cbadata.sports.sohu.com/ranking/players/1980/0/0\n",
      "NBA http://sports.sohu.com/s/nba\n",
      "男篮 http://sports.sohu.com/s/tcb\n",
      "WCBA http://sports.sohu.com/s/wcba\n",
      "NBL http://sports.sohu.com/nbl/\n",
      "苏群曝山东欠薪细节 最少的拖欠7个月哈德森第一个讨薪 https://www.sohu.com/a/487806257_461606?scm=1004.728314124706185216.0.0.0\n",
      "组图：林书豪晒健身器材 酒店隔离积极训练 https://www.sohu.com/picture/488033216?scm=1004.728314124706185216.0.0.0\n",
      "CBA全明星扣篮王离开上海 王潼成完全自由球员 https://www.sohu.com/a/487978660_461606?scm=1004.728314124706185216.0.0.0\n",
      "广州男篮官宣两小将上调一队 杜锋外甥身披15号球衣 https://www.sohu.com/a/487981595_461606?scm=1004.728314124706185216.0.0.0\n",
      "组图:赵继伟晒游玩帅照 卡丁车+实弹射击样样行 https://www.sohu.com/picture/487986611?scm=1004.728314124706185216.0.0.0\n",
      "林书豪晒训练器材：是时候开始今天的训练了 https://www.sohu.com/a/488033434_461606?scm=1004.728314124706185216.0.0.0\n",
      "沈梓捷生日遭赵睿蛋糕洗脸:我以为我躲过了(图) https://www.sohu.com/picture/487993324?scm=1004.728314329358860288.0.0.0\n",
      "四川男篮晒万人级别新主场照片:期待完美绽放(图) https://www.sohu.com/picture/487985393?scm=1004.728314329358860288.0.0.0\n",
      "中国轮椅女篮不敌荷兰队 获首枚残奥银牌创造历史 https://www.sohu.com/a/487806378_461606?scm=1004.728314329358860288.0.0.0\n",
      "曝CBA季前赛国庆节开赛 广东惠州为赛区之一 https://www.sohu.com/a/488010199_461606?scm=1004.728314329358860288.0.0.0\n",
      "CBA规定:若山东被书面催告后仍欠薪30天 球员有权解除合同 https://www.sohu.com/a/487808060_461606?scm=1004.728314329358860288.0.0.0\n",
      "曝9月7日CBA进行体测 CBA派专人飞往各队训练基地 https://www.sohu.com/a/487807314_461606?scm=1004.728314329358860288.0.0.0\n",
      "江南的城：沈阳能否成为CBA赛会制举办地联盟预计本周给出答复 http://www.sohu.com/a/488042138_362070?scm=1019.s000a.v1.0\n",
      "突尼斯男篮6战全胜卫冕非锦赛冠军梅杰里贡献22分6板1断3帽 http://www.sohu.com/a/488042067_362070?scm=1019.s000a.v1.0\n",
      "中国滑冰协会正式启动“三亿有我·滑起来”主题活动及系列公益活动 http://www.sohu.com/a/487785454_114977?scm=1019.s000a.v1.0\n",
      "3消息！周琦或联手宫鲁鸣，郭艾伦让位年轻人，山东男篮集体讨薪 http://www.sohu.com/a/487765937_647205?scm=1019.s000a.v1.0\n",
      "扎心互动！赵继伟调侃高诗岩被欠薪：听说你好几个月没工资？ http://www.sohu.com/a/487676227_553189?scm=1019.s000a.v1.0\n",
      "26岁还在冲击CBA！曾经的宁波大学是他的天下 http://www.sohu.com/a/487620811_99909989?scm=1019.s000a.v1.0\n",
      "正式复出！CBA名将重返深圳男篮，与沈梓捷再携手 http://www.sohu.com/a/487535837_120334142?scm=1019.s000a.v1.0\n",
      "刘传兴为何离开青岛男篮？合同年薪成真因，吴庆龙拒绝妥协 http://www.sohu.com/a/487533852_120463535?scm=1019.s000a.v1.0\n",
      "4消息！赵睿捣蛋鬼，沈阳申办成功，NBL接触贺天举，高速官宣接手 http://www.sohu.com/a/487250011_485972?scm=1019.s000a.v1.0\n",
      "NBA落选秀，落选CBA http://www.sohu.com/a/487153596_99909989?scm=1019.s000a.v1.0\n",
      "林书豪正式出院！与医护人员合影，穿定制T恤，晒隔离“大餐” http://www.sohu.com/a/487345514_511145?scm=1019.s000a.v1.0\n",
      "首钢官宣第二批注册球员两小将上调一队+签约新秀 http://www.sohu.com/a/486979867_461606?scm=1019.s000a.v1.0\n",
      "周琦能成为CBA版博斯曼吗？ http://www.sohu.com/a/486945727_138481?scm=1019.s000a.v1.0\n",
      "两败俱伤？周琦自废一年武功恐无济于事，交易成打破僵局唯一办法 http://www.sohu.com/a/486434407_138481?scm=1019.s000a.v1.0\n",
      "组图:雅尼斯解除隔离与首钢会合带队备战新赛季 http://www.sohu.com/a/485742196_461606?scm=1019.s000a.v1.0\n",
      "4消息！继伟调侃小高，体测确定，山东球员有望解约，广厦晒集训 http://www.sohu.com/a/487689877_485972?scm=1019.s000a.v1.0\n",
      "闹剧就此结束？小丁重返山东，但核心位置恐被此人彻底剥夺 http://www.sohu.com/a/486855829_100225154?scm=1019.s000a.v1.0\n",
      "中国男篮存在五个不会，杜锋指导已经做出改变，下届奥运会不是梦 http://www.sohu.com/a/486429875_120463520?scm=1019.s000a.v1.0\n",
      "浙江广厦球员注册信息公布广厦三少均顶薪续约3年 http://www.sohu.com/a/485969769_461606?scm=1019.s000a.v1.0\n",
      "连续续约10人！CBA又一土豪队直接给三巨头顶薪留住争冠班底 http://www.sohu.com/a/485899254_203884?scm=1019.s000a.v1.0\n",
      "赶紧去打NBA！99年新星真炸！再见周琦！他能成球队新老大 http://www.sohu.com/a/487956484_100005172?scm=1019.s000a.v1.0\n",
      "苏群曝山东欠薪细节最少的拖欠7个月哈德森第一个讨薪 http://www.sohu.com/a/487806257_461606?scm=1019.s000a.v1.0\n",
      "中国滑冰协会正式启动“三亿有我·滑起来”主题活动及系列公益活动 http://www.sohu.com/a/487785454_114977?scm=1019.s000a.v1.0\n",
      "苏群：山东队员在欠薪情况下没耽误比赛应该还他们公道 http://www.sohu.com/a/487636003_461606?scm=1019.s000a.v1.0\n",
      "欠薪一年！山东男篮全体队员实名讨薪签名+按手印 http://www.sohu.com/a/487634707_461606?scm=1019.s000a.v1.0\n",
      "CBA全明星扣篮王离开上海王潼成完全自由球员 http://www.sohu.com/a/487978660_461606?scm=1019.s000a.v1.0\n",
      "正式复出！CBA名将重返深圳男篮，与沈梓捷再携手 http://www.sohu.com/a/487535837_120334142?scm=1019.s000a.v1.0\n",
      "深圳男篮三人离队加盟NBL球队常亚松退役留队转型教练 http://www.sohu.com/a/487417352_461606?scm=1019.s000a.v1.0\n",
      "4消息！深圳挖4新星，女篮轮椅进决赛，赵睿太暖心，山东录制综艺 http://www.sohu.com/a/487386403_485972?scm=1019.s000a.v1.0\n",
      "早报|武磊与两品牌签下代言合同；武磊成为外星人补水推荐官 http://www.sohu.com/a/487229897_115533?scm=1019.s000a.v1.0\n",
      "整整退役5年！NBA状元终于回归篮球！新秀杜兰特真比不过他 http://www.sohu.com/a/487562989_100005172?scm=1019.s000a.v1.0\n",
      "回来了！力压杜兰特成为状元的球员！退役五年终于回归篮球 http://www.sohu.com/a/487272619_522256?scm=1019.s000a.v1.0\n",
      "周琦能成为CBA版博斯曼吗？ http://www.sohu.com/a/486945727_138481?scm=1019.s000a.v1.0\n",
      "中国女排传喜讯！自由人林莉的接班人浮出水面，不是王梦洁 http://www.sohu.com/a/486759894_100135484?scm=1019.s000a.v1.0\n",
      "上海更新注册球员名单王哲林顶薪签4年张知垚戴昊升一队 http://www.sohu.com/a/486748282_461606?scm=1019.s000a.v1.0\n",
      "刘传兴未与青岛续约将赴澳洲联赛青岛:尊重他选择 http://www.sohu.com/a/486876915_461606?scm=1019.s000a.v1.0\n",
      "官宣：赵戌宏加盟四川金强男篮 http://www.sohu.com/a/486742535_461606?scm=1019.s000a.v1.0\n",
      "周琦不差钱！身背多个大牌代言，手术后录制综艺，或死磕到底 http://www.sohu.com/a/486455542_511145?scm=1019.s000a.v1.0\n",
      "体育产业早餐8.28|C罗重回曼联签约两年曝快手拟与NBA战略合作 http://www.sohu.com/a/486206948_519172?scm=1019.s000a.v1.0\n",
      "青岛国信海天篮球俱乐部官宣：欢迎李原宇加入球队 http://www.sohu.com/a/486101765_461606?scm=1019.s000a.v1.0\n",
      "NBA http://sports.sohu.com/s/nba\n",
      "CBA http://sports.sohu.com/s/cba\n",
      "女篮 http://cbachina.sports.sohu.com/s2011/tcbw/\n",
      "NBL http://sports.sohu.com/nbl/\n",
      "高清：三对三篮球中国男篮绝杀波兰 颜鹏撞胸庆祝 https://www.sohu.com/picture/479757670?scm=1004.733273447903461376.0.0.0\n",
      "央视刘星宇解读3对3篮球：男女子状态良好 有望拿好成绩 https://www.sohu.com/a/479754280_461606?scm=1004.733273447903461376.0.0.0\n",
      "中国球员胡金秋场均10.3分 暂为三人篮球项目得分王 https://www.sohu.com/a/479754906_461606?scm=1004.733273447903461376.0.0.0\n",
      "2021男篮亚洲杯确定延期 改至2022年7月举办 https://www.sohu.com/a/479139490_461606?scm=1004.733273447903461376.0.0.0\n",
      "周琦总结奥运落选赛:看到与强队差距 还需加强身体对抗 https://www.sohu.com/a/477113556_461606?scm=1004.733273447903461376.0.0.0\n",
      "杜锋:多名球员带伤参加落选赛 两场比赛均有亮点表现 https://www.sohu.com/a/477115422_461606?scm=1004.733273447903461376.0.0.0\n",
      "日本海外组合归队备战奥运 八村塁将出战欧洲强队 https://www.sohu.com/a/477359519_461606?scm=1004.733273684474789888.0.0.0\n",
      "胡明轩:国际大赛身体对抗更强 需打法更聪明防守更专注 https://www.sohu.com/a/477118350_461606?scm=1004.733273684474789888.0.0.0\n",
      "赵继伟：希腊传导球出色值得学习 防守端应多呼应 https://www.sohu.com/a/477120871_461606?scm=1004.733273684474789888.0.0.0\n",
      "沈梓捷:自己在CBA挺有天赋 到了国际赛场毫无优势 https://www.sohu.com/a/477124823_461606?scm=1004.733273684474789888.0.0.0\n",
      "环球时报:男篮与世界强队差距越来越大 是否需要归化球员? https://www.sohu.com/a/476043631_461606?scm=1004.733273684474789888.0.0.0\n",
      "韩国小将获U19世界杯得分王 身高2米04场均25.6分 https://www.sohu.com/a/476877470_461606?scm=1004.733273684474789888.0.0.0\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "from collections import deque\n",
    "from urllib.parse import urljoin\n",
    "\n",
    "import requests\n",
    "\n",
    "LI_A_PATTERN = re.compile(r'<li class=\"item\">.*?</li>')\n",
    "A_TEXT_PATTERN = re.compile(r'<a\\s+[^>]*?>(.*?)</a>')\n",
    "A_HREF_PATTERN = re.compile(r'<a\\s+[^>]*?href=\"(.*?)\"\\s*[^>]*?>')\n",
    "\n",
    "\n",
    "def decode_page(page_bytes, charsets):\n",
    "    \"\"\"通过指定的字符集对页面进行解码\"\"\"\n",
    "    for charset in charsets:\n",
    "        try:\n",
    "            return page_bytes.decode(charset)\n",
    "        except UnicodeDecodeError:\n",
    "            pass\n",
    "\n",
    "\n",
    "def get_matched_parts(content_string, pattern):\n",
    "    \"\"\"从字符串中提取所有跟正则表达式匹配的内容\"\"\"\n",
    "    return pattern.findall(content_string, re.I) \\\n",
    "        if content_string else []\n",
    "\n",
    "\n",
    "def get_matched_part(content_string, pattern, group_no=1):\n",
    "    \"\"\"从字符串中提取跟正则表达式匹配的内容\"\"\"\n",
    "    match = pattern.search(content_string)\n",
    "    if match:\n",
    "        return match.group(group_no)\n",
    "\n",
    "\n",
    "def get_page_html(seed_url, *, charsets=('utf-8', )):\n",
    "    \"\"\"获取页面的HTML代码\"\"\"\n",
    "    resp = requests.get(seed_url)\n",
    "    if resp.status_code == 200:\n",
    "        return decode_page(resp.content, charsets)\n",
    "\n",
    "\n",
    "def repair_incorrect_href(current_url, href):\n",
    "    \"\"\"修正获取的href属性\"\"\"\n",
    "    if href.startswith('//'):\n",
    "        href = urljoin('http://', href)\n",
    "    elif href.startswith('/'):\n",
    "        href = urljoin(current_url, href)\n",
    "    return href if href.startswith('http') else ''\n",
    "\n",
    "\n",
    "def start_crawl(seed_url, pattern, *, max_depth=-1):\n",
    "    \"\"\"开始爬取数据\"\"\"\n",
    "    new_urls, visited_urls = deque(), set()\n",
    "    new_urls.append((seed_url, 0))\n",
    "    while new_urls:\n",
    "        current_url, depth = new_urls.popleft()\n",
    "        if depth != max_depth:\n",
    "            page_html = get_page_html(current_url, charsets=('utf-8', 'gbk'))\n",
    "            contents = get_matched_parts(page_html, pattern)\n",
    "            for content in contents:\n",
    "                text = get_matched_part(content, A_TEXT_PATTERN)\n",
    "                href = get_matched_part(content, A_HREF_PATTERN)\n",
    "                if href:\n",
    "                    href = repair_incorrect_href(current_url, href)\n",
    "                print(text, href)\n",
    "                if href and href not in visited_urls:\n",
    "                    new_urls.append((href, depth + 1))\n",
    "\n",
    "\n",
    "def main():\n",
    "    \"\"\"主函数\"\"\"\n",
    "    start_crawl(\n",
    "        seed_url='http://sports.sohu.com/nba_a.shtml',\n",
    "        pattern=LI_A_PATTERN,\n",
    "        max_depth=2\n",
    "    )\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "通过上面的例子，我们对爬虫已经有了一个感性的认识，在编写爬虫时有以下一些注意事项：\n",
    "\n",
    "1. 上面的代码使用了`requests`三方库来获取网络资源，这是一个非常优质的三方库，关于它的用法可以参考它的[官方文档](https://requests.readthedocs.io/zh_CN/latest/)。\n",
    "\n",
    "2. 上面的代码中使用了双端队列（`deque`）来保存待爬取的URL。双端队列相当于是使用链式存储结构的`list`，在双端队列的头尾添加和删除元素性能都比较好，刚好可以用来构造一个FIFO（先进先出）的队列结构。\n",
    "\n",
    "3. 处理相对路径。有的时候我们从页面中获取的链接不是一个完整的绝对链接而是一个相对链接，这种情况下需要将其与URL前缀进行拼接（`urllib.parse`中的`urljoin()`函数可以完成此项操作）。\n",
    "\n",
    "4. 设置代理服务。有些网站会限制访问的区域（例如美国的Netflix屏蔽了很多国家的访问），有些爬虫需要隐藏自己的身份，在这种情况下可以设置使用代理服务器，代理服务器有免费的服务器和付费的商业服务器，但后者稳定性和可用性都更好，强烈建议在商业项目中使用付费的商业代理服务器。如果使用`requests`三方库，可以在请求方法中添加`proxies`参数来指定代理服务器；如果使用标准库，可以通过修改`urllib.request`中的`ProxyHandler`来为请求设置代理服务器。\n",
    "\n",
    "5. 限制下载速度。如果我们的爬虫获取网页的速度过快，可能就会面临被封禁或者产生“损害动产”的风险（这个可能会导致吃官司且败诉），可以在两次获取页面数据之间添加延时从而对爬虫进行限速。\n",
    "\n",
    "6. 避免爬虫陷阱。有些网站会动态生成页面内容，这会导致产生无限多的页面（例如在线万年历通常会有无穷无尽的链接）。可以通过记录到达当前页面经过了多少个链接（链接深度）来解决该问题，当达到事先设定的最大深度时，爬虫就不再像队列中添加该网页中的链接了。\n",
    "\n",
    "7. 避开蜜罐链接。网站上的有些链接是浏览器中不可见的，这种链接通常是故意诱使爬虫去访问的蜜罐，一旦访问了这些链接，服务器就会判定请求是来自于爬虫的，这样可能会导致被服务器封禁IP地址。如何避开这些蜜罐链接我们在后面为大家进行讲解。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本节完。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}